Neurocomputational mechanisms involved in adaptation to fluctuating intentions of others

Humans frequently interact with agents whose intentions can fluctuate between competition and cooperation over time. It is unclear how the brain adapts to fluctuating intentions of others when the nature of the interactions (to cooperate or compete) is not explicitly and truthfully signaled. Here, we use model-based fMRI and a task in which participants thought they were playing with another player. In fact, they played with an algorithm that alternated without signaling between cooperative and competitive strategies. We show that a neurocomputational mechanism with arbitration between competitive and cooperative experts outperforms other learning models in predicting choice behavior. At the brain level, the fMRI results show that the ventral striatum and ventromedial prefrontal cortex track the difference of reliability between these experts. When attributing competitive intentions, we find increased coupling between these regions and a network that distinguishes prediction errors related to competition and cooperation. These findings provide a neurocomputational account of how the brain arbitrates dynamically between cooperative and competitive intentions when making adaptive social decisions.

Instructions were in French originally.Below, we added both the English and French versions.
Deliberately, we never used words synonymous with "against" or "partner" since such vocabulary could influence the participant's prior with respect to the goal of the other.
Instructions (english): You are going to be faced with 4 cards: two face down, those of the person you will be interacting with, and two face up, yours.Each turn you will have to choose one of your 2 cards.
When the other player has chosen her card, the card will be put in the middle face down without you being able to know which one has been chosen.When both of you have made your choices, you will see the card that the other player has chosen and a one-euro coin if you win, or a crossed-out one-euro coin if you lose.You win if both of you chose the same color card, otherwise you lose.You do not know what are the rules of the game for the person you are interacting with and you do not know which reward she will receive.There will be around 150 trials to perform in the scanner.There will be a one minute break half way through.The person with who you interact will not change between the 2 blocks.Every time you win, you will receive an extra 10 cents as a reward.

Debriefing of participants
We debriefed participants and asked the following 3 questions: Did you notice any changes in the way the other player played?If so, which ones? How did you notice it?At least 16 participants had a sense that the AA was switching between two different strategies (some participants did not answer anything specific to these questions).Below, we translate these debriefing reports: 1.The other player was losing money as he tried to do the "long sequence" technique, but it backfired (nb.The participant seems to think he is playing competitively.).However, sometimes he tried to avoid me.
2. He changes his strategy.In the second game, it was tighter, he understood my strategy.
3. Sometimes the other player would change a move when they were both winning, it looked like he was bored.It looked like the other player was changing.
4. The other had two strategies, either he stayed on the same card while losing, or he changed every other time or tried to change the sequence.5. Sometimes he would make series of up to 6 times the same card, sometimes he would alternate every two or three choices.6.What worked at one moment did not work afterwards [...].We found a logic and then suddenly it didn't work anymore.15.At first the other had the same rules, then he changed his strategy (Or did he have reasons to do so?), then sometimes he would go back to the same rules so that we would synchronize again.
16. Did we have the same goal?At first the other seemed to be cooperative.Then I realised that he didn't, I said to myself that we didn't have the same objective.Sometimes we would get into the same rhythm and then it would stop.[...].The other would do quite long sequences choosing the same colour, then he would change after 2 or 3 repetitions.

Specification of the Artificial agent algorithm
The artificial agent (AA) selected its target according to the probability for the player to choose a specific color after a given history.It stored the frequency that the participant chose each target after each possible history of four elements composed by two choices and two outcomes (see table S1).
We call the probability of the player choosing the black card   .In Competitive trial blocks, the AA will choose the black card with probability 1 −   , while in Cooperative trial blocks, it will choose the black card with probability   .A cooperative choice of the AA is defined as an AA choice following the most likely target chosen by the participant.Thus, even in competitive trial blocks, the AA can make a cooperative choice.Since the algorithm needs to be initialized, we arbitrarily defined the first five trials as random (the AA plays the black target with probability 0.5).The possible combinations that are not encountered during these initialization trials were assigned with a probability of choosing the black target of 0.5.

Description of computational models
The models described below are built to predict the probability to choose one specific target "a" or "b".
We define the probability to stay as {     ℎ       ℎ   to match the terminology used in the behavioral analysis.

Reinforcement learning model
Reinforcement learning (RL) consisted of directly linking action or state and outcome to predict future rewards after performing a particular action or being in a particular state.In our experiment, we updated action value with the Rescola-Wagner rule: Where  is the learning rate.The reward prediction error   is defined as the difference between the reward at trial t,   and the expected value of the choice a at trial t,    .Then the probability to choose action a is: With () = 1 1 +  − ⁄ the sigmoid function when  is a free parameter to capture the stochasticity of the participant's behavior (i.e. the exploration/exploitation trade-off).We defined the probability to stay as {     ℎ       ℎ . We used the same definition of the probability to stay for other models.

Fictitious play
In game theory, one can infer the probability that the other chooses one particular action and choose one's own action to maximize one's expected reward.This model is called a first order fictitious play model.Thus, the opponent's probability  * of choosing an action a is dynamically updated by tracking the choice history of the opponent: Where  is the learning rate.The reward prediction error    is defined as the difference between the expected action of the opponent at trial t,   * , and the actual other's choice on trial t, (  = 1) if the other's action is a and (  = 0) if it is b.Then the probability to choose action a depends on the payoff matrix.In the competitive setting of our game we can derive the probability q that the participant chooses action  = " " using the sigmoid function, the payoff matrix, and the probability that the other chooses action  = " ": * is the inferred probability that the other chooses  = " ".Because the payoff matrix is the same for the participant in both Competitive and Cooperative trial blocks, the mode of interaction has no impact on the decision stage.However, considering the other's decision rule, it would be different under competitive or cooperative assumption: Then, the fictitious agent uses the inferred probability that the other chooses a, the payoff matrix and the other temperature to compute a decision value.

Influence model
Another strategy could be to take into account how one's own actions influence the other's future actions.Thus, to compute the probability of updating of the other's strategy, we replaced update of opponent strategy (Eq.4) in the player decision rule (Eq.7).Then with a Taylor expansion taking  close to 0, we added the influence terms (Δ : influence update signal of the participant, Δ : influence update signal of the other): and  +1  * .
To compute the   * * and   * * , we invert the decision function (Eq.7): As for the fictitious agent, the influence learner uses the inferred probability that the other chooses a, the payoff matrix and the other temperature to compute a decision value (Eq 8).

k-ToM model
The k-ToM model is defined as in 1 .An economic game under game theory is defined by a utility table (  ,  ℎ ) to represent the payoff to players according to the actions of self,(  ) and the other player ( ℎ ).In our experiment, this utility table varies between Competitive and Cooperative blocks (see Fig 1a .).Because participants make a binary choice,   and  ℎ take the value of a=0 for one option and a=1 for the other option.According to Bayesian decision theory, agents try to maximize their expected value V = E[(  ,  ℎ )].We assume that agents use a softmax function as a decision rule: ) Eq 13 (  = 1) is the probability that the agent chooses option   = 1. is the sigmoid function and  is a free parameter called inverse temperature and controls for the magnitude of behavioral noise.
The value of each action depends on the probability of other's action with  ℎ = ( ℎ = 1) and the utility table (  ,  ℎ ) : =  ℎ * (  = ,  ℎ = 1) + (1 −  ℎ ) * (  = ,  ℎ = 1) Eq 14 One key hypothesis of this model is that we consider that the other agent is itself a k-ToM agent.It means that the other agent has the same decision policy as equation 13.Thus, while the agent tracks  ℎ , the other track   to construct a recursive reasoning.This recursion induces distinct levels of ToM sophistication between the two agents, impacting how agents update their subjective prediction of  ℎ . 1 k-ToM agents are defined according to the way they update this prediction of  ℎ starting from 0-ToM.Definition of higher level of reasoning is based on the level 0, for which ( ℎ = 1) = (  0 ), with the log-odd   0 varying with a volatility  0 .The updating rule for the hidden state   0 follows the Bayes-optimal probabilistic scheme : With ( +1 0 |  0 ) the 0-ToM's prior belief on the volatility of the log-odd, and (  0 ) ≡ (  0 | 1: ℎ ), the posterior belief about the log-odds   0 at trial t after the observation of all previous actions  ℎ .
Thus, one can derive the 0-ToM's learning rule: Eq 18 Where   0 is the approximate mean of 0-ToM posterior distribution of ( 0 ) and   0 is it's approximate variance.Thus   0 is the estimate of the 0-ToM log-odds at trial t and   0 her subjective uncertainty about it.
A 1-ToM agent assumes that the other agent reasons with a 0 depth ToM.Thus, with the decision policy of a 0-ToM agent we can construct a 1-ToM agent.More specifically, in combining equation 13, 14 and 15, 1-ToM agent assumes that the probability for a 0-ToM agent to emit action  ℎ = 1 is  ℎ =  ∘  1 (  1 ) (we use the symbol ∘ to refer to the composition of two functions defined as (g∘f)(x)=g(f(x)) ) with s the sigmoid function and  1 the value for 0-ToM agent to choose option 1 : With ΔU   = (  = ,  ℎ = 1) − (  = ,  ℎ = 0) which represents the incentive for the 1-ToM agent to choose option one if 1-ToM agent chooses option   = .   for a 1-ToM agent is the same as   ℎ for 0-ToM agent, thus : To let the 1-ToM agent eventually learn how 0-ToM agent learns about herself, and act in consequence, the 1-ToM agent assumes that hidden states   1 vary across trials with volatility  1 , which leads to a meta learning rule similar to equation 16, 17, 18.
In a more general fashion, an agent of depth  ≥ 2 considers that the other agent is a lower sophistication κ-ToM agent (κ ≤ k), but this sophistication has to be learned in addition to the hidden states  κ that track the opponent's learning and decision making.Thus a k-ToM agent continuously tracks all possible other's sophistication levels and it's associate action probability  ℎ,κ =  ∘  κ ( κ ) that he will choose  ℎ = 1.

Eq 25
Where    is k-ToM's probability that her opponent is κ-ToM,   is the gradient of   with respect to the hidden states   .Here,   is obtained by the recursive injections of equation 5 in equation 1, as we have already done to obtain equation 4.  ̃κ is defined in terms of the expectation operator: [ ∘  κ ( −1 ,κ ,  −1 ,κ ) ] =  ∘  ̃κ( −1 ,κ ,  −1 ,κ ).Equation 3 and 5 have been estimated using a Variational approach to approximate Bayesian inference.

Two Experts model
For models using the payoff matrix to update hidden states, making a difference between the Cooperative and Competitive modes (i.e.k-ToM and the Influence model), we fitted three models in different settings: competitive, cooperative or mixed intentions.When considering mixed-intentions models (k-TOM and influence models), we made the assumption that the cooperative expert and the competitive expert come from the same model (i.e.influence model or k-TOM) because, from the point of view of the participants, there is no indication that there are two modes of interaction.
Therefore, it is more parsimonious to assume that a single process (i.e.same computational model for both experts) is engaged along the task.
For the mixed-intentions setting, we ran the competitive and cooperative models in parallel, avoiding the need for the payoff matrix to be learnt.On the first trial, each expert gives a prior probability that the other would choose the "action a" (p coop,0 * = p comp,0 * = 0.5), then each expert follows its own walk in generating on each trial the probability that the other will choose option a for each possible mode of interactions, P comp a and P coop a .We then transformed the probability with the sigmoid function to have values ranging from −∞ to +∞.We have a binomial choice configuration thus   = −  in both competitive and cooperative settings.Thus, as    and    get close to zero, uncertainty for i, the other's intention, increases.We defined the reliability of the intention i as the absolute value of    and the probability that the other intention is cooperative as the sigmoid function of the difference in reliability between the two modes: Eq 26 where  is the inverse temperature controlling for the stochasticity of the mode of interaction and  is the bias towards cooperative mode.To motivate our definition of the reliability signal, we tested 4 definitions of reliability signals for the winning model:     Finally, the reward prediction error was defined as the reward at trial t for action a: Eq 29

Active inference model
For this model based on 5 , we adopted the partially observable Markov decision process (POMDP) framework which is a way of describing transitions among states under the hypothesis that the probability of the next state depends only on the current state.The partially observed aspect of the Markovian process means that states are not directly observable and have to be inferred though a set of (noisy) observations.Active inferences are composed of a tuple (P, Q, R, S, A, U , Ω) : • Ω is a finite set of possible observations • A is a finite set of possible action • S is a finite set of hidden states • U is a finite set of control states • R is the generative process over observation  ̃∈ Ω, hidden states ̃∈ , and action  ̃∈  ( ̃, ,  ̃) = ({ 0 , … ,   } =  ̃ , { 0 , … ,   } = ̃ , { 0 , … ,  −1 } =  ̃) • P is the generative model over observation  ̃∈ Ω , hidden states ̃∈ , and control states  ̃∈  ( ̃, ,  ̃|) = ({ 0 , … ,   } =  ̃ , { 0 , … ,   } = ̃ , { 0 , … ,   } =  ̃) with parameters .
• Q is the approximate posterior over hidden and control states (,  ̃) = ({ 0 , … ,   } = , { 0 , … ,   } =  ̃) with parameters or expectation (,  ̂), where  ∈ {1, … , } is a policy that indexes a sequence of control states Firstly, generative process describes the transition probabilities among hidden states which generate observations.Transition probabilities depend on actions which are sampled from approximate posterior belief about control states.Belief is formed using the generative model (denoted by m) of how observations are generated by hidden states.The Generative model encodes belief and hidden states of the agent in term of expectation.
The active inference model assumes that both action and expectation minimize the free energy of observations.That is, expectation minimizes free energy and expectation of control states prescribes actions in each trial.The generative mode could be defined as three marginal distributions: ( ̃, ,  ̃|) = ( ̃|) (| ̃) ( ̃|) Eq 33 Thus, heuristically, the decision consists firstly of figuring out which state is the most likely by optimizing its expectation according to free energy and the generative model.Then, after optimizing its posterior beliefs, an action is sampled from the posterior probability distribution over the control state.The environment generates a new observation given the selected action using the generative process and a new decision cycle begins.

lose"}
We defined 20 hidden states: S = {"Previous choice" x "previous reward" x "current correct answer" x "current mode of interaction" ; "Current target a, win"; "Current target a, loose"; "Current target b, win"; "Current target b, lose"} The finite set of action is A = {"Choose target b "; "Choose target a"} bringing the agent from the 16 first possible hidden states to their corresponding 4 last hidden states which are {"Current target a, win"; "Current target a, loose"; "Current target b, win"; "Current target b, lose"}.
Log prior preferences over the observed states are C = [ 0 ; 1 ; 0 ; 1 ; 2 ; -1 ; 2 ; -1] meaning that the agent prefers to observe, in decreasing order, a current winning, then a previous winning, a previous defeat and finally a current defeat.
For each trial, prior beliefs about hidden states are equally spread between the four hidden states composed by {"Previous choice" x "previous reward"} leaving unknown the "current mode of interaction" and the "current good answer".
To allow the agent to learn about the hidden state "current mode of interaction" we added concentration parameters about observation.Concentration parameters are prior about what hidden states lead to what observations, and can be viewed as the number of hidden states' occurrences encountered in the past.We arbitrarily set this number to 2 for being in "cooperative" mode, when observing a previous winning, and 1 for being in the hidden state "competitive".Inversely for a previous defeat, we set this parameter at 1 for being in the "cooperative" hidden state and 2 for being in the "competitive" state.
With   the sequence of the n last outcomes "".For each observation at time t, the update is a Laplace-Kalman rule:

Evidence for separate cooperative and competitive experts
Below, we demonstrate the distinguishability and independence of the two experts at the computational and behavioral levels.
First, it should be noted that the predictions of the cooperative expert are not always completely anti-correlated to the predictions of the competitive expert.Indeed, the model was designed in a way that allows the two experts be either correlated, anti-correlated or not correlated at all.The two experts start with the same prediction on the choice of the participant ( ), strategies of both experts will be correlated all along the game (because the first order PE is identical for both experts).Second, if the two PE (i.e first and second order) are of the same order of magnitude, the cooperative and competitive strategies will not be correlated at all.Third, if the second order PE is globally high with respect to the first order PE, the two strategies will be anti-correlated.These 3 cases demonstrate the validity of the 'mixture of experts' framework for modeling our task since the 2 experts are not structurally anti-correlated.Thus, our model cannot be reduced to a single expert that either cooperates or competes.
Second, Supplementary Figure 1 displays the correlations between the predictions of the 2 experts for each of the 31 participants.They show either a correlation, no correlation or anti-correlation between the predictions of the 2 experts.Thus, the two separate experts can be distinguished and they are not anti-correlated.Choice probabilities of the two experts were significantly correlated at the group level: R²=0.21 (CI[0.10;0.36])(p<0.0001).However, this correlation is not forced by the equations of the models (i.e. this correlation is conjectural and is NOT structural) since it depends upon free parameters, and the interaction between the choices of the participants and the choices of the AA.
Third, we observed no correlation between the reliability of the cooperative expert and the reliability of the competitive expert (R²=0.005,p=0.148) (Supplementary figure 2).This indicates that the two strategies (cooperative and competitive) are not reliable at the same time.This also demonstrates the importance of the second order prediction error term to differentiate the two strategies.
Fourth, the cooperative and competitive components of the decision value are correlated with only 0.4% of common variance (R²=0.0044,p= 0.001) (Supplementary figure 3).Note that when the decision value computed by the cooperative expert (respectively competitive expert) is around 0, the variance of the decision value of the competitive expert (respectively cooperative expert) is large.This indicates that when one expert is reliable, the other one is often unreliable.Globally, when one expert has a precise prediction regarding the best future choice for the participant, the other expert often proposes a less reliable choice.Thus, overall, the two experts are complementary.

7 .
Sometimes the other would always choose the same card for a long time, then he would vary, then choose the other card all the time.8.The other person changes his strategy, his intention.At first it looked like a teammate, then an opponent.9.The other was not consistent.[Sometimes he was consistent, and sometimes he was not consistent.10.Several changes.When I had the right result, he would change.He would make me feel confident and then change the strategy.11.He adapted, he made changes in strategy like me.12.He tried to adapt different strategies by repeating choices or alternating.13.I noticed changes of rhythm.Sometimes the opponent would follow me and then stop following me.Maybe he was changing the instructions.14.He would change sequences.Sometimes he would choose 7 times the black and then alternate once of each color [...].

(
,  ̂) =  ( ̃, ,  ̂) Eq 30 Pr(  =   ) = (  | ̂ * ) Eq 31 With: ( ̂, ,  ̂) =   [− ln ( ̃, ,  ̃|)] − [(,  ̃)] = − ln ( ̃|) +   [(,  ̃) || (,  ̃| ̃)] Eq 32 This model reproduces a heuristic behavior, precisely "I keep the same option if I just won, I switch if I just lost".To implement that, we use two pseudo Q-values,   = 1 for the action of stay and  ℎ = −1 for the action of switch.Then we use the sigmoid function to compute the probability of choosing the same option as the previous trial:  = (  −  ℎ )Eq 44 participant observed during the experiment).We found that no model produce behavior that could be confounded with the winning model (Mixed-intention influence model).Moreover, the behavior generated by the mixed-intention influence model could not be recover by another model in our model set, see confusion matrix in Supplementary figure5.

Table 4 .
Brain regions that responded differently for the reward prediction error of a trial estimated to be competitive rather than cooperative No brain region ** cluster reported at p<0.05 FWE whole brain cluster corrected (initial cluster-forming threshold of p<0.001 uncorrected)